Conversation
soaxelbrooke
left a comment
There was a problem hiding this comment.
I'd prefer to avoid changing the default tokenization behavior. I can see adding tok as an optional alternative in the Encoder constructor, though.
| # [26, 108, 79, 104, 72, 24, 26, 117, 24, 9, 11, 8, 12, 10, 26, 90, 24, 26, 154, 56, 37, 149, 80, 169, 84, 24, 26, 156, 24] | ||
| # [25, 102, 76, 77, 68, 24, 25, 149, 24, 13, 10, 11, 12, 25, 79, 24, 25, 135, 58, 37, 152, 81, 160, 108, 24, 25, 143, 24] | ||
| print(next(encoder.inverse_transform(encoder.transform([example])))) | ||
| # vizzini : he didn ' t fall ? inconceivable ! |
There was a problem hiding this comment.
So using tok would change the default tokenization?
|
For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text. |
|
@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)! i.e. I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the |
|
@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose |
The tokenizer makes for a better default (faster and saner).
Targets #10